Perceiver: General Perception with Iterative Attention - 🍣YuWd(和田唯我)のメモ🍣

Perceiver: General Perception with Iterative Attention

https://gyazo.com/9eb83752f74bf34478003e1553273308

Transformer を改善

Qを潜在変数とすることで, $ L^2の呪いから解放してあげる

音声系 / 時系列予測にも適してる

潜在変数をcentroidとして, 高次元の入力 $ x をend-to-endでクラスタリングしてるとも捉えうる

つまり, 入力$ xをタグ付けしてるイメージ (と論文内で言っている)

Positional Encoding

普通のPEの代わりに, フーリエ変換した特徴量を使う

code:main.py

scales = torch.linspace(1., max_freq / 2, num_bands, device = device, dtype = dtype)

scales = scales(*((None,) * (len(x.shape) - 1)), Ellipsis)

x = x * scales * pi

x = torch.cat(x.sin(), x.cos(), dim = -1)

x = torch.cat((x, orig_x), dim = -1)

https://github.com/lucidrains/perceiver-pytorch/blob/main/perceiver_pytorch/perceiver_pytorch.py#L33

NeRFと関連が深いらしい

Rahamanによると, NNは低周波数を学習しがちらしい (ON THE SPECTRAL BIAS OF NEURAL NETWORKS (ICML18))

なので, 事前に入力を高周波成分を用いた高次元空間に飛ばせば, 高周波なものも学習しやすくなるらしい (by NeRF)

We use a parameterization of Fourier features that allows us to (i) directly represent the position structure of the input data (preserving 1D temporal or 2D spatial structure for audio or images, respectively, or 3D spatiotemporal structure for videos), (ii) control the number of frequency bands in our position encoding independently of the cutoff frequency, and (iii) uniformly sample all frequencies up to a target resolution. We parametrize the frequency encoding to take the values (sin(fkπxd), cos(fkπxd)), where the frequency fk is the k th band of a bank of frequencies spaced equally between 1 and µ 2 . µ 2 can be naturally interpreted as the Nyquist frequency (Nyquist, 1928) corresponding to a target sampling rate of µ.

追記

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains を読むと良い

https://github.com/lucidrains/perceiver-pytorch/blob/main/perceiver_pytorch/perceiver_pytorch.py#L227